A New Universal Code Helps to Distinguish Natural Language from Random Texts
نویسنده
چکیده
Using a new universal distribution called switch distribution, we reveal a prominent statistical difference between a text in natural language and its unigram version. For the text in natural language, the cross mutual information grows as a power law, whereas for the unigram text, it grows logarithmically. In this way, we corroborate Hilberg’s conjecture and disprove an alternative hypothesis that texts in natural language are generated by the unigram model.
منابع مشابه
The Relaxed Hilberg Conjecture: A Review and New Experimental Support
The relaxed Hilberg conjecture states that the mutual information between two adjacent blocks of text in natural language grows as a power of the block length. The present paper reviews recent results concerning this conjecture. First, the relaxed Hilberg conjecture occurs when the texts repeatedly describe a random reality and Herdan’s law for facts repeatedly described in the texts is obeyed....
متن کاملHilberg’s Conjecture — a Challenge for Machine Learning
We review three mathematical developments linked with Hilberg’s conjecture—a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second,...
متن کاملPredicate Preserving Parsing
We present a unique approach to knowledge extraction from texts by a method of natural language analysis which preserves the predicate till the end. The system thus named Predicate Preserving Parser (PPP) performs morphological, syntactic and semantic analysis synchronously. This approach helps in highly accurate analysis of sentences. The analysis produces a semantic net like structure express...
متن کاملOn the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
The article presents a new interpretation for Zipf’s law in natural language which relies on two areas of information theory. Firstly, we reformulate the problem of grammar-based compression and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an n-letter long ...
متن کاملUNL Based Bangla Natural Text Conversion - Predicate Preserving Parser Approach
Universal Networking Language (UNL) is a declarative formal language that is used to represent semantic data extracted from natural language texts. This paper presents a novel approach to converting Bangla natural language text into UNL using a method known as Predicate Preserving Parser (PPP) technique. PPP performs morphological, syntactic and semantic, and lexical analysis of text synchronou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015